White wine Exploration by TOMOYA KUBO

This report explores a dataset cotaining quality of white wine and attributes for approximately 4900 white wine.

Univariate Plots Section

Abstract of data

## [1] 4898
## [1] 13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Our dataset consists of 13 variables, with almost 4,900 observations.

Quality of wine

The distribution of white wine quality look like normal distribution. Why is the normal distribution? I wonder if whine wine quality decide by accident? I wonder what this plot looks like across attributes.

Fixed.acidity in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The smallest amount of fixed acidity is 3.8 and the largest is 14.2. Above, I plot main body of the amount of fixed acidity. The distribution of this variable is alomost normal distriburion. I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And fixed.acidity have some outliers.

Volatile.acidity in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The smallest amount of Volatile acidity is 0.08 and the largest is 1.1. Above, I plot main body of the amount of Volatile acidity. The distribution of this variable is alomost normal distriburion too. I wonder this valiable don’t effect wine qulity that’s why it is same reason of fixed acidity. And volatile acidity have some outliers.

Citric.acid in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The smallest amount of citric.acid is 0 and the largest is 1.66. Above, I plot main body of the amount of citric acidity. The distribution of this variable is alomost normal distriburion too. But, the distribution has one big spike near 0.5.I wonder what the spike has something of feature or the outliers have something of features. And citric acidity have some outliers.

Residual.sugar in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Transformed the long tail data to better understand the distribution of residual sugar. The tranformed residual sugar distribution appears bimodal with the residual sugar peaking around 1.5 or so and again at 8.0 or so. I wonder what each peak effect quality.

Chlorides in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The smallest amount of chlorides is 0.009 and the largest is 0.346. Above, I plot main body of the amount of clorides. The distribution of this variable is alomost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And cholrides have some outliers.

Free.sulfur.dioxide in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The smallest amount of free.sulfur.dioxide is 2.0 and the largest is 289.0. Above, I plot main body of the amount of free.sulfur.dioxide. The distribution of this variable is almost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And free.sulfur.dioxide have some outliers.

Total.sulfur.dioxide in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The smallest amount of total.sulfur.dioxide is 9.0 and the largest is 440.0. Above, I plot main body of the amount of free.sulfur.dioxide. The distribution of this variable is almost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And total.sulfur.dioxide have some outliers.

Density of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Most wine have a density between 0.991 g/cm^3 and 0.997 g/cm^3: median 0.9937 g/cm^3 and mean 0.9940 g/cm^3. And density have a little outliers.

pH of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Most wine have a pH between 2.85 and 3.6: median 3.18 and mean 3.188.The distribution of this variable is almost normal distriburion.I wonder this valiable don’t effect wine qulity that’s why the wine quality is normal distribution too. And pH have some outliers.

Sulphates of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Transformed the long tail data to better understand the distribution of sulphates. The tranformed sulphates distribution appears normal. but this distribution has one spike near 0.5. What is this peak? And the log10 transformed sulphates have some outliers.

Alcohol of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Most wine have a alcohol between 8.5 % and 13 %: median 10.4% and mean 10.51 %. This distribution is little right skewed. I wonder what center of quality is less than 10%.

Create Category about quality

## 
##  (0,4]  (4,7] (7,10] 
##    183   4535    180

Check the distribution1 by category of quality

Each variable don’t have the large difference of the distribution by quality condition. I wonder if residual.sugar have best conbination of other variables in low condition or high conditon.

Check the distribution2 by category of quality

Each variable don’t have the large difference of the distribution by quality condition.

Check the distribution3 by category of quality

Alcohol have the difference of the distribution by quality. High rate of alcohol tend to be better quality. I wonder if high rate of alcohol effect quality to be better.

Univariate Analysis

What is the structure of your dataset?

There are 4,898 wine in the dataset with 13 variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). All variables are numeric.

Other observation:

  • The median quality is 6.
  • Most wine quality are between 5 and 7.

What is/are the main feature(s) of interest in your dataset?

The main feature of this dataset is quality of wine. We need to know the effect of other variables to the value. I’d like to determine which features are best for predicting the quality of white wine. I suspect alcohol and some combination of the other variables can be used to build a predictive model for wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Alcohol likely contribute to the quality of wine. I think all other variables support my investigation into quality. Because wine taste is decided by the conbination of wine ingredient.

Did you create any new variables from existing variables in the dataset?

I created a variable for the category of quality. By this variable, I did check the distribution of each variables to know effect of each variable into quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the right skewed residual.sugar and volatile.acidity. The tranformed distribution for residual.sugar appears bimodal with the residual sugar peaking around 1.5 or so and again around 8.0.

Bivariate Plots Section

FALSE                      fixed.acidity volatile.acidity  citric.acid
FALSE fixed.acidity           1.00000000      -0.02269729  0.289180698
FALSE volatile.acidity       -0.02269729       1.00000000 -0.149471811
FALSE citric.acid             0.28918070      -0.14947181  1.000000000
FALSE residual.sugar          0.08902070       0.06428606  0.094211624
FALSE chlorides               0.02308564       0.07051157  0.114364448
FALSE free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
FALSE total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
FALSE density                 0.26533101       0.02711385  0.149502571
FALSE pH                     -0.42585829      -0.03191537 -0.163748211
FALSE sulphates              -0.01714299      -0.03572815  0.062330940
FALSE alcohol                -0.12088112       0.06771794 -0.075728730
FALSE quality                -0.11366283      -0.19472297 -0.009209091
FALSE                      residual.sugar   chlorides free.sulfur.dioxide
FALSE fixed.acidity            0.08902070  0.02308564       -0.0493958591
FALSE volatile.acidity         0.06428606  0.07051157       -0.0970119393
FALSE citric.acid              0.09421162  0.11436445        0.0940772210
FALSE residual.sugar           1.00000000  0.08868454        0.2990983537
FALSE chlorides                0.08868454  1.00000000        0.1013923521
FALSE free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
FALSE total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
FALSE density                  0.83896645  0.25721132        0.2942104109
FALSE pH                      -0.19413345 -0.09043946       -0.0006177961
FALSE sulphates               -0.02666437  0.01676288        0.0592172458
FALSE alcohol                 -0.45063122 -0.36018871       -0.2501039415
FALSE quality                 -0.09757683 -0.20993441        0.0081580671
FALSE                      total.sulfur.dioxide     density            pH
FALSE fixed.acidity                 0.091069756  0.26533101 -0.4258582910
FALSE volatile.acidity              0.089260504  0.02711385 -0.0319153683
FALSE citric.acid                   0.121130798  0.14950257 -0.1637482114
FALSE residual.sugar                0.401439311  0.83896645 -0.1941334540
FALSE chlorides                     0.198910300  0.25721132 -0.0904394560
FALSE free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
FALSE total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
FALSE density                       0.529881324  1.00000000 -0.0935914935
FALSE pH                            0.002320972 -0.09359149  1.0000000000
FALSE sulphates                     0.134562367  0.07449315  0.1559514973
FALSE alcohol                      -0.448892102 -0.78013762  0.1214320987
FALSE quality                      -0.174737218 -0.30712331  0.0994272457
FALSE                        sulphates     alcohol      quality
FALSE fixed.acidity        -0.01714299 -0.12088112 -0.113662831
FALSE volatile.acidity     -0.03572815  0.06771794 -0.194722969
FALSE citric.acid           0.06233094 -0.07572873 -0.009209091
FALSE residual.sugar       -0.02666437 -0.45063122 -0.097576829
FALSE chlorides             0.01676288 -0.36018871 -0.209934411
FALSE free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
FALSE total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
FALSE density               0.07449315 -0.78013762 -0.307123313
FALSE pH                    0.15595150  0.12143210  0.099427246
FALSE sulphates             1.00000000 -0.01743277  0.053677877
FALSE alcohol              -0.01743277  1.00000000  0.435574715
FALSE quality               0.05367788  0.43557472  1.000000000

Some variables that are following tend to correlate each other. Density correlate closely with many variables.

The most of valiables correlate roughly with quality. The top 3 of strong correlation are alcohol(0.44), density(-0.31) and chlorides(-0.21).

Adding jitter, transparency, and changing the plot limits let us see the positive corelation between total.sulfur.dioxide and free.sulfur.dioxide. This relationship occur for the reason that total.sulfur.dioxide include free.sulfur.dioxide

Adding jitter, transparency, and changing the plot limits let us see the positive corelation between alcohol and quality.

Adding jitter, transparency, and changing the plot limits let us see the negative corelation between density and quality.

Adding jitter, transparency, and changing the plot limits let us see the negative corelation between chlorides and quality.

Alcohol correlate closely with density. This is general thing. Material property is decided by the element. I think varience occur by measurement method and difference of the environment in test. Therefore, I think we should use only one alcohol or density to predict wine quality.

I ploted each density distribution of volatile.acidity by quality level(bad, normal, good). I transformed the long tail data to better understand the distribution of volatile.acidity. From this plot, I can understand there are small differenct of the distribution pattern by each quality. names(wine)

I ploted each density distribution of residual.sugar by quality level(bad, normal, good). I transformed the long tail data to better understand the distribution of residual.sugar. From this plot, I can see small differenct of the distribution pattern by each quality. These distributions exist two regions are low and high residual.sugar. In low residual.sugar region, better quality wine is high residual.sugar. In high residual.sugar region, better quality wine is low residual.sugar.

I ploted each density distribution of alcohol by quality level(bad, normal, good). From this plot, I can understand there are a lot of better quality wine in high alcohol. But, I can’t recognize the difference between normal and bad. Which variable dicide the difference. I’ll check other density distirbution of variables.

I ploted each density distribution of free.sulfur.dioxide by quality level(bad, normal, good). By free.sulfur.dioxide, I recognize the diffrence of the distribution pattern between bad and more than it. I transformed the long tail data to better understand the distribution of free.sulfur.dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Alcohol is key of prediction for wine quality. Because alcohol correlate the most closely with quality in variables. Some variables correlate with alcohol. Therefore I get them out from the variables for wine quality prediction. From the result, I think I need volatile.acidity. Density look like better correration factor, but it isn’t included in the variables for wine quality prediction that’s why it is material property that correlates with some other variables that are ingredient for wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The positive corelation occur between total.sulfur.dioxide and free.sulfur.dioxide. Total.sulfur.dioxide include free.sulfur.dioxide. Therefore, it’s usual relationship. Good quality wine exist a lot in the region of high alcohol rate. But, it couldn’t only recognize the difference of the distribution pattern between normal and bad for wine quality. I found the variable that is free.sulfur.dioxide. Each density distribution of log transformed residual.sugar by quality level(bad, normal, good). In each quality level, each density distibution have same pattern that is bimodal. But, near each maximum point in the bimodal distribution, each density distribution have the difference.In low residual.sugar region, better quality wine is high residual.sugar. In high residual.sugar region is opposite.

What was the strongest relationship you found?

The strongest relationship that is 0.83896645 occur residual.sugar and density.

Multivariate Plots Section

I’d like to get information that which variables decide the better wine. In the Bivariate Plots Section, I knew alcohol is better variable.

I ploted the scatter plot with alcohol and log transformed free.sulfur.dioxide divided by the level of quality. From this graph, I’m able to understand the effect of free.sulfur.dixide on each quality. By high free.sulfur.dioxide, quality is increased, but the effect is small on low alcohol.

I ploted the box plot with alcohol and log transformed free.sulfur.dioxide divided by the level of quality. From this graph, I’m able to understand the variance of free.sulfur.dixide on each quality. On each alcohol bucket, high quality tend to decrease the variance of log transformed free.sulfur.dioxide.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide), 
##     data = wine)
## m3: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide) + 
##     log10(volatile.acidity), data = wine)
## m4: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide) + 
##     log10(volatile.acidity) + log10(residual.sugar), data = wine)
## m5: lm(formula = quality ~ alcohol + log10(free.sulfur.dioxide) + 
##     log10(volatile.acidity) + log10(residual.sugar) + chlorides + 
##     total.sulfur.dioxide + density + pH + sulphates, data = wine)
## m_base: lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = wine)
## 
## ==================================================================================================================
##                                    m1            m2            m3            m4            m5          m_base     
## ------------------------------------------------------------------------------------------------------------------
##   (Intercept)                     2.582***      1.080***      0.360**      -0.030        44.163***    150.193***  
##                                  (0.098)       (0.134)       (0.136)       (0.142)       (9.594)      (18.804)    
##   alcohol                         0.313***      0.347***      0.354***      0.385***      0.292***      0.193***  
##                                  (0.009)       (0.009)       (0.009)       (0.010)       (0.017)       (0.024)    
##   log10(free.sulfur.dioxide)                    0.771***      0.704***      0.584***      0.744***                
##                                                (0.048)       (0.047)       (0.048)       (0.059)                  
##   log10(volatile.acidity)                                    -1.280***     -1.404***     -1.257***                
##                                                              (0.074)       (0.075)       (0.077)                  
##   log10(residual.sugar)                                                     0.272***      0.499***                
##                                                                            (0.031)       (0.050)                  
##   chlorides                                                                              -0.894        -0.247     
##                                                                                          (0.527)       (0.547)    
##   total.sulfur.dioxide                                                                   -0.002***     -0.000     
##                                                                                          (0.000)       (0.000)    
##   density                                                                               -44.587***   -150.284***  
##                                                                                          (9.582)      (19.075)    
##   pH                                                                                      0.284***      0.686***  
##                                                                                          (0.074)       (0.105)    
##   sulphates                                                                               0.506***      0.631***  
##                                                                                          (0.097)       (0.100)    
##   fixed.acidity                                                                                         0.066**   
##                                                                                                        (0.021)    
##   volatile.acidity                                                                                     -1.863***  
##                                                                                                        (0.114)    
##   citric.acid                                                                                           0.022     
##                                                                                                        (0.096)    
##   residual.sugar                                                                                        0.081***  
##                                                                                                        (0.008)    
##   free.sulfur.dioxide                                                                                   0.004***  
##                                                                                                        (0.001)    
## ------------------------------------------------------------------------------------------------------------------
##   R-squared                       0.190         0.230         0.275         0.286         0.301         0.282     
##   adj. R-squared                  0.190         0.230         0.275         0.286         0.299         0.280     
##   sigma                           0.797         0.777         0.754         0.748         0.741         0.751     
##   F                            1146.395       732.923       618.802       491.077       233.466       174.344     
##   p                               0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood              -5839.391     -5713.108     -5567.036     -5528.056     -5478.898     -5543.740     
##   Deviance                     3112.257      2955.840      2784.692      2740.720      2686.255      2758.329     
##   AIC                         11684.782     11434.216     11144.072     11068.111     10979.797     11113.480     
##   BIC                         11704.272     11460.202     11176.555     11107.091     11051.259     11197.936     
##   N                            4898          4898          4898          4898          4898          4898         
## ==================================================================================================================

The variables in this linear model with the log transformation of free.sulfur.dioxide, volatile.acidity and residual.sugar can account for 30.1% of the variance in the quality of wine, compared to 28.2% without the transformation. I get two variables that fixed.acidity and citric.acid out, that why don’t have the correlation with wine quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

By the scatter plot with alcohol and log transformed free.sulfur.dioxide, I could recognize quality is increased on high free.sulfur.dioxide, but the effect is small on low alcohol.

Were there any interesting or surprising interactions between features?

By the box plot with alcohol and log transformed free.sulfur.dioxide, the variance of the log transformed free.sulfur.dioxide is tended to decrease on each wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model starting from the quality of wine and alcohol and some log transformed variables(free.sulfur.dioxide, volatile.acidity and residual.sugar) and some normal variables(chlorides, total.sulfur.dioxide, pH, sulphates).

Final Plots and Summary

Plot One

Description One

The distribution of wine quality appears to be normal. This is natural distribution. Everyone don’t want to make bad wine. But it is hard to make good wine.

Plot Two

Description Two

Wine quality is mainly decided by alcohol and free.sulfur.dioxide. The better wine tend to have high alcohol rate. And bad wine tend to have low free.sulfur dioxide. Free.sulfur dioxide is added for protecting the oxidation of wine. Therefore, perhaps, it indicates a lot wine need the countermeasure of the oxidation for better wine quality.

Plot Three

Description Three

The plot indicates that wine qulity is mainly predicted by alochol and transformed free.sulfur.dioxide by using linear model.By the transformed free.sulfur.dioxide difference between high and low, quality change about one.

Reflection

The white wine data set contains information on almost 4,900 white wine across twelve variables. I started by understanding the individual variables in the data set, and then I explored questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wine across many variables and created a linear model to predict wine quality.

There was a strong trend between the alcohol and quality in dataset variables. High alcohol tended to be better quality of wine. But, wine quality wasn’t only explain alcohol that have correlation factor 0.436. To predict wine quality, some log transformed variables and no transformed variables were needed. Therefore, I checked the density distribution of alcohol by each quality level. And I confirmed the distribution pattern being only able to recognize better quality wine. From the result, I looked for the distribution pattern being able to recognize bad quality wine. That was the log transformed free.sulfur.dioxide. And I transformed volatile.acidity and residual.sugar to log too for better understanding. For creating a linear model, I used alochol, three log transformed variables and four variables(chlorides, total.sulfur.dioxide, pH, sulphates) have a little correlatin with wine quality. The model was able to account for 30.1% of the variance in the dataset. I feel this R squared value is small to predict wine quality. Perhaps, it is hard to predict wine quality that have the non-linear relationship with some variables by using linear model that have small discreption capacity.

Some limitations of this model include the source of the data. Given that the white wine date until 2009, the model would likely undervalue white wine in the market today, either due to changes in evaluation method and estimator. To investigate this data further, I would be interested in testing the non-linear model to predict wine quality. And I’d like to increase model prediction accuracy. I’m also interested in more recent dataset how change the trend of the relationship with wine quality and other variables.